Productivity Assessment of Neural Code Completion (+ Copilot)

Key points from the summary:

Copilot looks for opportunities to do code completion. Often, these opportunities do not lead to completions being shown (e.g. because the user keeps on typing). See figure below.

Acceptance rate is lower in standard working hours. There are some theories, but unconfirmed.

Key caveats of the paper:

Correlation is not causation

The amount of these trends that are explained is quite low

Self-perceptions of productivity don’t necessarily reflect objective productivity

Acceptance rate is not the only metric optimized for

How does Copilot decide the length of completion?

There’s a tradeoff between single-line vs. multi-line completions, Copilot tries to hit the sweet spot

They are doing experiments to find weaker points of Copilot (e.g. try/catch statements)

They have not tested this on other tasks like writing docstrings

Usually, they use the same stopping criteria as in GPT (like enforcing a max length, looking for double newlines), and then do post-hoc trimming off of the end

How did Copilot decide what size of model to use?

Largely based on intuition a few years ago, though with new advances they might reevaluate

Need to make sure latency is ~ 200 ms

For models other than completion (test generation, documentation), larger models are used because these aren’t as time-sensitive

What about privacy and security concerns?

One constraint is that Copilot can innovate based on their telemetry, but not on user code. In fact, they do not any of the user code while finetuning.

In addition, there are security concerns from larger companies

Do they do retrieval?

A first version that is currently used is to just check files that are open

They do string similarity matching from chunks in other open files

There is no neural encoder (unlike a lot of the latest work in retrieval)

Could we do some reprompting based on errors?

This is hard because Copilot has to perform completions fast

However, there is a lot of offline evaluation of this

For example, they have done offline evaluation to try to get generated code to pass unit tests

However, running arbitrary code in the wild is scary!

There is also work based on guessing execution info without executing the code which they hope to incorporate

Will there be future features like editing, repair, etc?

Yes! Check out Copilot labs and Code Brushes
- https://githubnext.com/projects/copilot-labs
- https://githubnext.com/projects/code-brushes/

Will Copilot be able to learn your personalized coding style?

Code brushes supports a feature where you can define a custom code prompt that gets fed into the language model

If you have a lot of typos in your code, it will likely come up with more typos (because of the nature of prompting)